DataVizAssignment_Ziqi_3.0

A short description of the post.

Ziqi Wang https://example.com/norajones
04-11-2021

1.0 Introduction

This report details the preliminary visualization preparation for a project undertaken for the course ISSS608 - Visual Analytics and Applications offered in SMU MITB. As one of the project deliverables, this assiganment is a sub-module (module 1) from our proposed project - Enabling optimization of Bike Sharing Operations – Bluebikes.

2.0 Literature Review

Based on Shayini, S. (2017, March 24). RPubs - Bike Sharing Data Analysis with R. RPubs. https://www.rpubs.com/shayini/bike_sharing, the analysis digs into data exploration aspect, we include geospatial data mapping in our project to enrich the data visualization. According to Low, C. T. (2018, September 5). Analysis and Visualization of Blue Bikes Sharing in Boston. DataScience+. https://datascienceplus.com/blue-bikes-sharing-in-boston/, the analysis and visualization were performed with mapping using leaflet. Specifically, dock stations are marked in map as well as trips in each dock station are present through heatmap. For dock stations distribution, popups function are used to show the name of the dock station by clicking, to enhance the user experience, we use the htmltool package to enable the label display by hovering house over a marker. As regard to the heatmap, the overall popularity of each dock station can be acquired, however we cannot gain the detailed trip number or station information from it. To narrow this gap, we intend to use dot density to represent the popularity of each dock station, and increase the interactivity by using cluster, on top of that, we add filters to allow us to investigate the usage of different type of users along one day.

3.0 Data Preparation

3.1 Install Packages

Packages required

library(data.table) # for faster loading on large dataset

library(tidyverse) # load ggplot, dplyr

library(leaflet) # interactive mapping

library(leaflet.extras) # extra mapping for leaflet

library(lubridate) # formatting date and time

library(gridExtra) # multiple plot arrange

library(grid) # multiple plot arrange

library(psych) # to get describe function

library(janitor) # to get tabulate variables function

library(htmltool) # for HTML generation and output

library(crosstalk) # for filter inputs

packages = c('data.table','tidyverse','leaflet','leaflet.extras','lubridate','gridExtra','grid','psych','janitor', 'htmltools','crosstalk')
for (p in packages){
  if(!require(p, character.only = T)){
  install.packages(p)
  }
  library(p,character.only= T )
}

3.2 Importing Data

#load trip data
trip <- read_csv('data/202001-bluebikes-tripdata.csv')
#load station data
station <- read.csv('data/current_bluebikes_stations.csv')

3.3 Checking the overview of the dataset

glimpse(trip)
Rows: 128,598
Columns: 15
$ tripduration              <dbl> 478, 363, 284, 193, 428, 695, 1...
$ starttime                 <dttm> 2020-01-01 00:04:05, 2020-01-0...
$ stoptime                  <dttm> 2020-01-01 00:12:04, 2020-01-0...
$ `start station id`        <dbl> 366, 219, 219, 396, 60, 372, 36...
$ `start station name`      <chr> "Broadway T Stop", "Boston East...
$ `start station latitude`  <dbl> 42.34278, 42.37331, 42.37331, 4...
$ `start station longitude` <dbl> -71.05747, -71.04102, -71.04102...
$ `end station id`          <dbl> 93, 212, 212, 387, 49, 178, 23,...
$ `end station name`        <chr> "JFK/UMass T Stop", "Maverick S...
$ `end station latitude`    <dbl> 42.32034, 42.36884, 42.36884, 4...
$ `end station longitude`   <dbl> -71.05118, -71.03978, -71.03978...
$ bikeid                    <dbl> 6005, 3168, 3985, 2692, 4978, 5...
$ usertype                  <chr> "Customer", "Subscriber", "Subs...
$ `birth year`              <dbl> 1969, 2000, 2001, 1978, 1987, 1...
$ gender                    <dbl> 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0...

3.4 Data Wrangling

# Convert second to minutes and round it non decimal points
trip$minutes <- round(trip$tripduration/60, 0)

# Calculate the age from birth year
trip$age <- 2020 - trip$'birth year'

# Recode gender variables
trip$gender <- as.factor(recode(trip$gender, '0' = 'Female', '1' = 'Male', '2' = 'Prefer not to say'))

# Convert the time and date variable to correct format
trip$starttime <- ymd_hms(trip$starttime)
trip$stoptime <- ymd_hms(trip$stoptime)

# Separate date and time
trip <- trip %>%
  mutate_at(vars(starttime), ymd_hms) %>%
  mutate_at(vars(starttime), funs('start_date' = date(.)))

trip <- trip %>%
  mutate_at(vars(stoptime), ymd_hms) %>%
  mutate_at(vars(stoptime), funs('stop_date' = date(.)))

# Extracting day, weekday and hour
trip$day <- day(trip$starttime)
trip$weekday <- wday(trip$starttime, label = TRUE)
trip$hour <- hour(trip$starttime)


# Select relevant variable
trip_data <- select(trip, 'tripduration', 'start_date', 'stop_date', 'start station name', 'start station latitude', 'start station longitude', 'end station name' , 'end station latitude', 'end station longitude', 'usertype', 'gender', 'minutes', 'age', 'day', 'weekday', 'hour')

# Rename variable
colnames(trip_data) <- c('tripduration', 'start_date', 'stop_date', 'start_name', 'start_lat', 'start_long', 'end_name' ,'end_lat', 'end_long', 'usertype', 'gender', 'minutes', 'age', 'day', 'weekday', 'hour')

# Convert into factor
trip_data$weekday <- as.factor(trip_data$weekday)

3.5 Checking the overview of the dataset again

glimpse(trip_data)
Rows: 128,598
Columns: 16
$ tripduration <dbl> 478, 363, 284, 193, 428, 695, 1336, 1329, 12...
$ start_date   <date> 2020-01-01, 2020-01-01, 2020-01-01, 2020-01...
$ stop_date    <date> 2020-01-01, 2020-01-01, 2020-01-01, 2020-01...
$ start_name   <chr> "Broadway T Stop", "Boston East - 126 Border...
$ start_lat    <dbl> 42.34278, 42.37331, 42.37331, 42.40933, 42.3...
$ start_long   <dbl> -71.05747, -71.04102, -71.04102, -71.06382, ...
$ end_name     <chr> "JFK/UMass T Stop", "Maverick Square - Lewis...
$ end_lat      <dbl> 42.32034, 42.36884, 42.36884, 42.40986, 42.3...
$ end_long     <dbl> -71.05118, -71.03978, -71.03978, -71.06632, ...
$ usertype     <chr> "Customer", "Subscriber", "Subscriber", "Sub...
$ gender       <fct> Female, Male, Male, Male, Male, Male, Female...
$ minutes      <dbl> 8, 6, 5, 3, 7, 12, 22, 22, 22, 21, 21, 20, 1...
$ age          <dbl> 51, 20, 19, 42, 33, 31, 51, 51, 51, 51, 51, ...
$ day          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ weekday      <ord> Wed, Wed, Wed, Wed, Wed, Wed, Wed, Wed, Wed,...
$ hour         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

4.0 Visualisation Preparation

4.1 Visualize all the dock stations around Boston

Before the analysis, we first visualize all the dock stations around Boston, and the dock station name will be displayed while hovering over the mouse.

DockStation <- station %>% 
  leaflet() %>%
  setView(lng = -71.0589, lat = 42.3601, zoom = 13) %>% 
  addTiles() %>%
  addMarkers(lng = station$Longitude, 
             lat = station$Latitude, 
             label = ~htmlEscape(station$Name))
DockStation

4.2 Dot Density Map with Filter

To shrink the data size and select the data according to our analyzing purpose, we add filters to the dot density map, with that we are able to visualize the changes in number of trip for each station through the day as well as difference between two types of user.

# Make mapping color palette for numerical variables
popularity<-tabyl(trip_data, start_name)
cPal<-colorNumeric(palette = "YlOrRd", domain = popularity$n)
#starting location
#mapping
shared_trip<-SharedData$new(trip_data)
bscols(widths=c(3,9),
  list(
    filter_checkbox("usertype", "User Type", shared_trip, ~usertype, columns = 1),
    filter_slider("hour", "Hour of Day", shared_trip, ~hour, step=1)
  ),
  leaflet(shared_trip) %>%
  setView(lng = -71.0589, lat = 42.3601, zoom = 13) %>% 
  addTiles() %>%
  addCircleMarkers(lng = trip_data$start_long, 
                   lat = trip_data$start_lat,
                   color = cPal(popularity$n),
                   stroke = FALSE,
                   radius = sqrt(popularity$n)/5,
                   fillOpacity = 1,
                   popup = paste('Station:',trip_data$start_name, '<br>',
                           'Number of trips:',popularity$n)) %>%
  addLegend("bottomright", pal = cPal, values = ~popularity$n,
      title = "Number of Trip",
      opacity = 1)
)

5.0 Storyboard